Denote:
Two outcomes: so is like flipping a coin!
Denote:
where \(0\leq \theta \leq 1\). We can represent in the following probability distribution:
\[\text{Pr}(y_i=y) = \theta^y(1-\theta)^{1-y} \]
which is known as the Bernoulli distribution:
\[\begin{equation} y_i \sim \text{Bernoulli}(\theta) \end{equation}\]
Suppose you flip coin twice: \(y_1=1\), \(y_2=0\). Assuming independence:
\[\begin{equation} \text{Pr}(y_1=1,y_2=0|\theta) = \theta \times (1-\theta). \end{equation}\]
We call \(L(\theta)=\text{Pr}(y_1=1,y_2=0|\theta)\) a likelihood.
Want to choose \(\theta\) to maximise probability of obtaining those results
Find maximum by differentiating:
\[\begin{equation} \frac{d L}{d\theta} = 1 - 2 \theta = 0 \end{equation}\]
Rearranging, we obtain:
\[\begin{equation} \theta = \frac{1}{2} \end{equation}\]
In logistic regression, we use logistic function:
\[\begin{equation} \theta = \frac{1}{1 + \exp (-x)} \end{equation}\]
We want to estimate how sensitive presence / absence of lung cancer is to tar, so model probability:
\[\begin{equation} \theta_i = f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_i))} \end{equation}\]
which is known as logistic regression and assume:
\[\begin{equation} y_i \sim \text{Bernoulli}(\theta_i) \end{equation}\]
Data for one individual \((x_i,y_i)\) has probability:
\[\text{Pr}(y_i=y) = f_\beta(x_i)^y(1-f_\beta(x_i))^{1-y} \]
Suppose we have data \((x_1,y_1=1)\) and \((x_2,y_2=0)\).
Assume data are:
Then overall probability is just product of individual:
\[\begin{array} L &= f_\beta(x_1)^{y_1} (1-f_\beta(x_1))^{1-y_1} f_\beta(x_2)^{y_2}(1-f_\beta(x_2))^{1-y_2}\\ &= f_\beta(x_1) (1-f_\beta(x_2)) \end{array}\]Same logic applies under i.i.d. assumption:
\[\begin{equation} L = \prod_{i=1}^{K} f_\beta(x_i)^{y_i} (1 - f_\beta(x_i))^{1 - y_i} \end{equation}\]
Unlike the simple coin flipping case, there is no analytic solution to the maximum likelihood estimates. Instead, do gradient descent:
\[\begin{align} \beta_0 &= \beta_0 - \eta \frac{\partial L}{\partial \beta_0}\\ \beta_1 &= \beta_1 - \eta \frac{\partial L}{\partial \beta_1} \end{align}\]
where \(\eta>0\) is the learning rate.
Suppose we estimate that \(\beta_0=-1\) and \(\beta_1=2\). What do these mean?
\[\begin{equation} \theta_i = \frac{1}{1 + \exp (-(-1 + 2 x_i))} \end{equation}\]
so impact of incremental changes in \(x_i\) on the probability of lung cancer is nonlinear
\[\begin{equation} \theta_i = \frac{1}{1 + \exp (-(-1 + 2 x_i))} \end{equation}\]
meaning
\[\begin{equation} 1-\theta_i = \frac{\exp (-(-1 + 2 x_i))}{1 + \exp (-(-1 + 2 x_i))} \end{equation}\]
The ratio of probability of lung cancer to probability of cancer-free is called odds:
\[\begin{align} \frac{\theta_i}{1-\theta_i} &=\exp (-1 + 2 x_i) \end{align}\]
so here \(\exp 2\approx 7.4\) gives the change to the odds for a one unit change in x_i. Because of this, \(\exp \beta_1\) is known as the odds ratio for that variable
Taking log of both sides:
\[\begin{equation} \log \frac{\theta_i}{1-\theta_i} = -1 + 2 x_i \end{equation}\]
so we see that \(\beta_1=2\) effectively gives the change to the log-odds for a one unit change in \(x_i\).
straightforward to extend the model to incorporate multiple regressions:
\[\begin{equation} f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))} \end{equation}\]
for new data point \(\tilde x_i\):
for new data point \(\tilde x_i\):
many options possible. Common metrics include:
\[\begin{equation} s(x_1,x_2) = \frac{x_1.x_2}{|x_1||x_2|} \end{equation}\]
assume
\[\begin{equation} \boldsymbol{x} \sim \mathcal{N}(0, I) \end{equation}\]
where \(I\in\mathbb{R}^d\). What does the distribution of Euclidean distances between points look like as \(d\) changes?
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{equation} f(\boldsymbol{x}):=\text{Pr}(C=1|\boldsymbol{x}) := \frac{1}{|\mathcal{S}(\boldsymbol{x})|} \sum_{i\in \mathcal{S}(\boldsymbol{x})} C_i \end{equation}\]
where \(\mathcal{S}(\boldsymbol{x})\) is the set of individuals with symptoms \(\boldsymbol{x}\)
\[\begin{align} f(\emptyset) &= \frac{3}{5}\\ f(B=1) &= \frac{1}{2}\\ f(B=1,T=1) &= 0\\ \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{equation} H = -\sum_{i=1}^{2} p_i \log p_i = -(p \log p + (1 - p) \log (1-p)) \end{equation}\]
define \(H(D|a)\) as the conditional entropy after splitting on a given variable \(a\) for training data \(D\):
\[\begin{equation} H(D|a) = \sum_{v\in\text{vals}(a)} \frac{|\mathcal{S}(v)|}{|\mathcal{S}(\emptyset)|} H(\mathcal{S}(v)) \end{equation}\]
when we start, we have entropy:
\[\begin{equation} H(\phi) = -3/5\log (3/5) - 2/5 \log (2/5) \approx 0.97 \end{equation}\]
\[\begin{align} H(D|B) &= 3/5 H(B=0) + 2/5 H(B=1)\\ &= 3/5 (-2/3\log(2/3) - 1/3\log(1/3))\\ \;\;& + 2/5 (-1/2\log(1/2) - 1/2\log(1/2)) \approx 0.95 \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{align} H(D|T) &= 2/5 H(T=0) + 3/5 H(T=1)\\ &= 2/5 (0)\\ \;\;& + 3/5 (-2/3\log(2/3) - 1/3\log(1/3)) \approx 0.55 \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
\[\begin{align} H(D|L) &= 1/5 H(L=0) + 4/5 H(L=1)\\ &= 1/5 (0)\\ \;\;& + 4/5 (-1/2\log(1/2) - 1/2\log(1/2)) = 0.8 \end{align}\]
| patient | breathing issue (B) | high temp (T) | loss taste (L) | covid |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 |
| 3 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 | 0 |
| 5 | 0 | 1 | 0 | 1 |
initial entropy \(\approx\) 0.97. After splitting:
so splitting on \(T\) is optimal
tree still defined as:
\[\begin{equation} f(x) := \frac{1}{|\mathcal{S}(\boldsymbol{x})|} \sum_{i\in \mathcal{S}(\boldsymbol{x})} y_i \end{equation}\]
but \(y_i\in\mathbb{R}\) and split based on reduction of standard deviation in \(y\) opposed to entropy
\[\begin{equation} f(\boldsymbol{x}) = \text{mode} \bigcup\limits_{b=1}^{B} f_b(\boldsymbol{x}) \end{equation}\]
where \(f_b(\boldsymbol{x})\) is a decision tree trained on a random sample (with replacement) drawn from original training set; there are \(B\) such samples. Process known as “bagging”
estimating mean height of individuals in population
assume start with naive model:
\[\begin{equation} f_0 = \frac{1}{K}\sum_{i=1}^{K} y_i \end{equation}\]
calculate residuals:
\[\begin{equation} \hat{y_i} = y_i - f_0 \end{equation}\]
train new model \(f(x_i)\) on \(\hat{y_i}\) \(\implies f_1\)
new predictor becomes:
\[\begin{equation} f(x_i) := f_0 + \alpha f_1(x_i) \end{equation}\]
where \(\alpha\) is learning rate (a hyperparameter). Now calculate new residuals:
\[\begin{equation} \hat{y_i} = y_i - f(x_i) \end{equation}\]
and fit second decision tree to it; and so on. Iterative process of improving models known as boosting
boosting parameters:
tree parameters: